Data Lake
15 items in Data Lake
Projects
Blog Posts
Iceberg Series, Part 6: Multi-Engine & Maintenance
Querying Iceberg from Trino, Flink, and DuckDB; expiring snapshots; rewriting data files; and keeping Iceberg tables healthy in production.
Iceberg Series, Part 5: Row-Level Operations
How MERGE, UPDATE, and DELETE work in Iceberg — copy-on-write vs merge-on-read, when to use each, and the performance trade-offs.
Iceberg Series, Part 4: Hidden Partitioning & Evolution
Partition transforms that derive partition values automatically, partition evolution that changes strategy without rewriting data, and why these are Iceberg's biggest ergonomic wins.
Iceberg Series, Part 3: Catalogs
How Hive, Glue, REST, and Nessie catalogs coordinate multi-engine access to Iceberg tables — and why the catalog abstraction is Iceberg's biggest differentiator.
Iceberg Series, Part 2: Table Format Internals
The four-layer metadata hierarchy — table metadata, manifest lists, manifest files, and data files — and how it enables efficient scans and snapshot isolation.
Iceberg Series, Part 1: Getting Started
Creating Iceberg tables with Spark, reads, writes, MERGE, time travel, and inspecting table history.
Iceberg Series, Part 0: Overview
What is Apache Iceberg, how does it differ from Delta Lake and Hudi, and why multi-engine interoperability is its defining advantage.
Delta Lake Series, Part 6: Streaming & CDC
Writing to Delta with Structured Streaming, exactly-once guarantees, reading Delta as a stream, and Change Data Feed for downstream propagation.
Delta Lake Series, Part 5: Performance Optimization
Making Delta Lake queries fast — OPTIMIZE, Z-ordering, data skipping with column statistics, compaction, and partitioning strategies.
Delta Lake Series, Part 4: Time Travel & Versioning
Querying historical snapshots by version or timestamp, rolling back bad writes, auditing the table history, and managing retention with VACUUM.
Delta Lake Series, Part 3: Schema Enforcement & Evolution
How Delta Lake validates schemas on write, rejects incompatible data, and handles controlled schema changes over time.
Delta Lake Series, Part 2: Transaction Log & ACID
How the Delta Lake transaction log enables atomicity, serializable isolation, optimistic concurrency, and conflict resolution.
Delta Lake Series, Part 1: Getting Started
Creating Delta tables, reading and writing with Spark, Delta SQL, and what the _delta_log looks like in practice.
Delta Lake Series, Part 0: Overview
The data lake reliability problem, what Delta Lake adds on top of Parquet, and how it compares to Apache Iceberg and Apache Hudi.